Virtual Manipulation of Datasets Tutorial

In this tutorial, we will be using two Dataset types, namely Sampled Dataset and Transposed Dataset. This tutorial assumes familiarity with Procedures. We suggest going through the Procedures and Functions Tutorial beforehand.

To run the examples below, we created a toy dataset using the description of machine learning concepts from Wikipedia. Our dataset is made up of two columns. The first column contains the name of the machine learning concept. The second column contains the corresponding description.

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.


In [8]:
from pymldb import Connection
mldb = Connection("http://localhost")

Loading the Dataset

A powerful feature of MLDB allows us to execute SQL code while loading data. Here, the tokenize function first splits the columns into (lowercased) individual words, removing words with a length of less than 4 characters. Notice that we have chosen to use a dataset of type sparse.mutable since we do know ahead time the column size for each row.


In [9]:
print mldb.put('/v1/procedures/import_ML_concepts', {
        "type":"import.text",
        "params": 
        {
            "dataFileUrl":"http://public.mldb.ai/datasets/MachineLearningConcepts.csv",
            "outputDataset":{
                "id":"ml_concepts", # can be accessed later using id
                "type": "sparse.mutable" # sparse mutable dataset is needed since we tokenize words below
            },
            "named": "Concepts", #row name expression for output dataset
            "select": 
                """ 
                    tokenize(
                        lower(Text), 
                        {splitChars: ' -''"?!;:/[]*,().',  
                        minTokenLength: 4}) AS *
                """, # within the tokenize function:
                        # lower case
                        # leave out punctuation and spaces
                        # allow only words > 4 characters
            "runOnCreation": True
        }
    }
)


<Response [201]>

A quick look at the data

We can use the Query API to get the data into a Pandas DataFrame to take a quick look at it. You will notice that certain cells have a 'NaN' value as seen below. This is because the dataset is sparse: every word is not present in the description of every concept. Those missing values are representend as NaNs in a Pandas DataFrame.


In [10]:
mldb.query("SELECT * FROM ml_concepts LIMIT 5")


Out[10]:
addition algorithm algorithms also analysis analyze applications approach assigns associated ... popularized provide rather recurrent serve stored systems threshold understanding wrong
_rowName
Support vector machine 1 2 1 1 1 1 1 1 1 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Logistic regression NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Deep belief network NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Restricted boltzmann machines NaN 1 2 NaN NaN NaN 1 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Hopfield network NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN ... 1 1 1 1 1 1 1 1 1 1

5 rows × 286 columns

Exploring the data

Let's count the number of words that describes each concept.


In [11]:
mldb.query(
    """
    SELECT horizontal_sum({*}) AS count
    FROM ml_concepts
    """
)


Out[11]:
count
_rowName
Support vector machine 149
Logistic regression 99
Deep belief network 98
Restricted boltzmann machines 124
Hopfield network 53
Naive bayes classifier 165
Boltzmann machine 89
Autoencoder 31
Artificial neural network 40

Taking the transpose of data

There are two ways to take the transpose of a dataset:

1. using a transposed dataset
2. using a transpose SQL function

In [12]:
print mldb.put("/v1/datasets/transposed_concepts", {
        "type": "transposed",
        "params": 
        {
            "dataset": {"id":"ml_concepts"}
        }
    }
)


<Response [201]>

The code above transposed the data and created the transposed_concepts dataset. Now that we have transposed our data, we can easily see which words are most frequently used in the Machine Learning concept descriptions. Please note that we have joined below the results from the transposed_concepts dataset and the transpose inline function to compare outputs.


In [13]:
mldb.query(
    """
        SELECT 
            horizontal_count({a.*}) AS top_words_transp_dataset,
            horizontal_count({b.*}) AS top_words_transp_function
        NAMED a.rowName()
        FROM transposed_concepts AS a
        JOIN transpose(ml_concepts) AS b
            ON a.rowName() = b.rowName()
            ORDER BY top_words_transp_dataset DESC
            LIMIT 10
    """
)


Out[13]:
top_words_transp_dataset top_words_transp_function
_rowName
learning 7 7
with 7 7
neural 6 6
machine 6 6
training 5 5
used 5 5
machines 5 5
network 5 5
model 5 5
which 4 4

While the methods yield similar results, the transposed dataset allows the user to keep the output dataset in memory. This may be useful if you want to use the dataset in multiple steps of your data analysis. In cases where you don't need to use the transpose more than once, you can simply use the inline syntax.

Taking a sample of data

There are two ways to take samples of dataset:

1. using a sampled dataset 
2. using a sample SQL function

In [14]:
print mldb.put("/v1/datasets/sampled_tokens", {
        "type": "sampled",
        "params": 
        {
            "rows": 10,
            "withReplacement": False,
            "dataset": {"id":"transposed_concepts"},
            "seed": 0
        }
    }
)


<Response [201]>

In [15]:
mldb.query("SELECT * FROM sampled_tokens")


Out[15]:
Naive bayes classifier Artificial neural network Boltzmann machine Restricted boltzmann machines Support vector machine Deep belief network Logistic regression Hopfield network
_rowName
maximum 1 NaN NaN NaN NaN NaN NaN NaN
nervous NaN 1 NaN NaN NaN NaN NaN NaN
random NaN NaN 1 NaN NaN NaN NaN NaN
studied 1 NaN NaN NaN NaN NaN NaN NaN
neurons NaN NaN NaN 1 NaN NaN NaN NaN
symmetric NaN NaN NaN 1 NaN NaN NaN NaN
marked NaN NaN NaN NaN 1 NaN NaN NaN
such 1 NaN NaN NaN NaN 1 1 NaN
hopfield NaN NaN 1 NaN NaN NaN NaN 4
only NaN NaN NaN NaN 1 NaN NaN NaN

The above code would be simlar to the following SQL function:

mldb.query(
    """
    SELECT * 
    FROM sample(
        transposed_concepts, 
        {
            rows: 10,
            withReplacement: False
        }
    )                      
    """
)

Again, the two methods provide the same desired outcome, allowing you to choose how to best manipulate your data.

Using datasets vs. SQL functions

As seen above, the two different ways to either transpose or sample data are equivalent. It is recommended to use Dataset types instead of SQL functions when the created table will be reused or called later in the program.

As seen in this tutorial, MLDB allows you to virtually manipluate data in multiple ways. This grants greater flexibility when constructing models and analyzing complex data.

Where to next?

Check out the other Tutorials and Demos.